Hyperspectral Image Classification: Potentials, Challenges, and Future Directions

Recent imaging science and technology discoveries have considered hyperspectral imagery and remote sensing. The current intelligent technologies, such as support vector machines, sparse representations, active learning, extreme learning machines, transfer learning, and deep learning, are typically based on the learning of the machines. These techniques enrich the processing of such three-dimensional, multiple bands, and high-resolution images with their precision and fidelity. This article presents an extensive survey depicting machine-dependent technologies' contributions and deep learning on landcover classification based on hyperspectral images. The objective of this study is three-fold. First, after reading a large pool of Web of Science (WoS), Scopus, SCI, and SCIE-indexed and SCIE-related articles, we provide a novel approach for review work that is entirely systematic and aids in the inspiration of finding research gaps and developing embedded questions. Second, we emphasize contemporary advances in machine learning (ML) methods for identifying hyperspectral images, with a brief, organized overview and a thorough assessment of the literature involved. Finally, we draw the conclusions to assist researchers in expanding their understanding of the relationship between machine learning and hyperspectral images for future research.


Introduction
Hyperspectral imagery is one of the most significant discoveries in remote sensing imaging sciences and technological advancements. Hyperspectral imagery (HSI) is the technology that depicts the perfect combination of Geographic Information System (GIS) and remote sensing. Besides, HSI has several advantages such as ecological protection, security, agriculture and horticulture applications, crop specification and monitoring, medical diagnosis, identification, and quantification [1]. RGB images are made up of three dimensions: width, height, and 3 color bands or channels consisting of color information, that is, red, green, and blue. ey are stored as a 3D byte array that explicitly holds a color value for each pixel in the image; a combination of RGB intensities put down onto a color plane. However, in contrast, HSI comprises thousands of hypercubes and hence possesses a large resolution and an enormous amount of embedded information of all kinds-spectral, spatial, and temporal. is information enables various applications to detect and characterize land covers, which are most significantly explored [2]. RGB images are captured by digital RGB cameras capable of characterizing objects only based on their shape and color. Moreover, the embedded information is minimal since only three visible bands are available in the human visibility range. e HSI, on the other hand, is captured by specialized airborne hyperspectral sensors placed on artificial satellites, that is, spectrometers. ey have a broad range of scenes by acquiring large numbers of consecutive bands, not confined to the visible light spectrum and through a wider spectral band-pass. However, compared to the digital sensor that absorbs light in just three wide channels, a hyperspectral sensor's channel width is much narrower, making the spectral resolution and data volume much higher, resulting in hurdles to store, mine, and manage [3]. Furthermore, processing these data with a massive number of bands imposes many obstacles such as noise-causing image calibration, geometric distortion, noisy labels, and limited or unbalanced labeled training samples [4][5][6], that is, Hughes phenomenon and dimensionality reduction-related artifacts: overfitting, redundancy, spectral variability, loss of significant features between the channels, etc. [7].
Classifying HSIs is considered to be an intrinsically nonlinear problem [8], and the initial approach by lineartransformation-based statistical techniques such as principle component analytical methods, that is, principal component analysis (PCA) [9] and independent component analysis (ICA) [10]; the discriminant analytical methods, that is, linear [11] and fisher [12]; wavelet transforms [13]; and composite [14], probabilistic [15], and generalized [16] kernel methods, had shown promising outcomes. Still, their focus was limited to spatial information. ey emphasized that the feature extractor techniques assisted by some basic random classifiers that lead to complexity in terms of cost, space, and time are not sufficiently accurate. After the success of these traditional methodical techniques assigned for HSI classification, researchers became keenly interested in applying the most recent emerging but not tedious computer-based methods that made the entire process smoother and vicinal to perfection. Study advancements suggest that the last decade can be considered the most escalating era regarding computer-based technologies due to the emergence of machine learning (ML). ML is an algorithmic and powerful tool that resembles the human brain's cognition. It simply represents a complex system by holding abstraction. Hence, it can reduce complexities and peep into the insights of the vast amount of HS data to fetch out the hidden discriminative features, both spectral and spatial [17]. us, it overcomes all the stumbling blocks to achieve the desired accuracy in identifying the classes that the objects of the target HSI data belong to. Hence, they act as all-in-one techniques that can serve the purpose without further assistance. Keeping this in mind, we conducted an extensive survey based on the various discriminative machine and deep learning (ML, DL) models for HSI. In most of the literature studies, the HSI datasets that are commonly used for landcover classification are AVIRIS Indian Pines (IP), Kennedy Space Center (KSC), Salinas Valley (SV), and ROSIS-03 University of Pavia (UP), along with less frequently used Pavia Center, Botswana, University of Houston (HU), etc. ey are pre-refined and made publicly available on [18] for download and perform operations. e motivation of our work is divided into three parts. First, a novel methodology is proposed for the review work that is entirely systematic and helps find the inspiration in forming the research gaps and embedded questions after going through a large pool of research articles. Second, this work focuses on the current advancements of ML technologies for classifying HSI, with their brief, methodical description and a detailed review of the literature involved with them. Finally, the inferences are drawn and help the researchers boost knowledge for their future research. e key contributions made to the research field on hyperspectral imagery by our novel effort are as follows: (1) e thorough revision of the analytical and classification work carried out to date on HS imagery by employing ML/DL techniques. (2) Emphasis on the categorized methods explored and practiced so far in an overly frequent manner. Also, it includes a brief interpretation of the most recent technologies and the highlighted hybrid techniques. (3) An open knowledge base that acts as a reservoir of relevant information that is listed out that interprets all research on each mentioned technique in terms of their methodology, convenience and limitations, and future strategies. is illustration might administrate in making a proper choice of objective for further research on the field of HSIs. (4) Explicit idea of the growth of interest in the concerned field that would attract researchers to invest themselves with a coherent, substantial specification (benefaction and drawbacks) of all the methods, individually, that contributes academically to the researchers about their favorable result and the difficulties for a chosen technique. (5) A transitory rendition of the most recent research on HSIs signifies the currently adapted technologies as hot spots. Also, focus on the research areas about the interest that could apply to others, that is, the hybridized methods popular among researchers to address the problem and achieve the desired experimental results.
e rest of the article is arranged as follows: Section 2 briefly explains the constraints faced by the researchers in dealing with HSI; Section 3 represents the methodology for the research along with the motive behind this review; Section 4 describes seven ML techniques, namely, support vector machine (SVM), sparse representation (SR), Markov random field (MRF), extreme learning machine (ELM), active learning (AL), deep learning (DL), and transfer learning (TL); Section 5 shows up the complete summary of the literature review work in the form of answers to the research questions; Section 6 depicts the conclusions; and Section 7 explains the limitations and future work.

Constraints of HSI Classification
Since their emergence, several difficulties have caused issues in analyzing and performing operations on hyperspectral images. Initially, it suffered from spectroscopy technology due to the bad quality of hyperspectral sensors and poor quality with insufficient data. However, along with the advancement in applied science, things have come to ease, but there are still some well-known nondispersible hitches that need to be overcome. Some of them are stated as follows: (a) Lack of high-resolution Earth observation (EO) noiseless images: During the initial stage of the discovery of spectrometers, they were not very efficient. Due to this, noises caused by water vapor, atmospheric pollutants, and other atmospheric perturbations modify the signals coming from the Earth's surface for Earth observations. Several efforts have been made over the last decades to produce high-quality hyperspectral data for Earth observation and develop a wide range of high-performance spectrometers that combines the power of digital imaging, spectroscopy, and extracting numerous embedded spatial-spectral features [19]. (b) Hindrances in the extraction of features: During data gathering, redundancy across contiguous spectral bands results in the availability of duplicated information, both spatially and spectrally, obstructing the optimal and discriminative retrieval of spatialspectral characteristics [7]. (c) e large spatial variability and interclass similarity: e hyperspectral dataset collected contains unusable noisy bands due to mistakes in the acquisition that result in information loss in terms of the unique identity, that is, the spectral signatures and excessive intraclass variability. Furthermore, with poor resolution, each pixel comprises broad spatial regions on the Earth's surface, generating spectral signature mixing, contributing to the enhanced interclass similarity in border regions, thus creating inconsistencies and uncertainties for employed classification algorithms [19]. (d) Limitation of available training samples and insufficient labeled data: Aerial spectrometers cover significantly smaller areas, so they can only collect a limited number of hyperspectral data. at leads to the restriction of the number of training samples for classification models [20]. In addition, HSIs typically contain classes that correspond to a single scene, and available classification models' learning procedures require labeled data. However, labeling each pixel requires human skill, which is arduous and timeconsuming [21]. (e) Lack of balance among interclass samples: e class imbalance problems, where each class sample has a wide range of occurrences, diminish the usefulness of many existing algorithms in terms of enhancing minority class accuracy without compromising majority class accuracy, which is a difficult task in and of itself [22]. (f ) e higher dimensionality: Due to incorporating more information in multiple channels, such highband pictures increase estimation errors. e curse of dimensionality is a significant drawback for supervised classification algorithms, as it significantly impacts their performance and accuracy [23].
e possible solutions to the above limitations that also represent the possible operations that are performed to analyze and comprehend the HSIs can be (1) technological advancement to make versatile and robust hardware for the spectrometers to capture the scenes more accurately, (2) spectral unmixing and resolution enhancement for better feature extraction and distinguishing capability of the embedded objects, (3) image compression-restoration and dimensionality reduction for addressing the high-dimensions and lack of data, and (4) use of robust classifiers that are capable of dealing with the above issues as well as promote fast computation ability [7]. ese hurdles were very prominent for the methods that classify HSI based on the feature extrication from HSI. After ML/DL came into the scene, the operations on HSI became effortless as explicit feature extraction is not needed, and it has also many advantages such as great dealing with noise and time complexity. However, ML/DL acquires a few drawbacks in specific criteria [19], including parametertuning and numerous local minima problems in training procedures and compression [20] overfitting, optimization, and convergence problems despite many positive aspects.

Research Methodology
is section is divided into three categories that will assist in understanding the review procedure and its ambition.

Planning of the Review.
ree systematic advances are utilized that comprise the planning behind our work. First, based on efficacy and frequency of applicability on classifying HSIs, seven most recently used ML techniques have been chosen in this article for review, which establishes the operational relationship and compatibility with the issue of categorizing the land covers of a particular scene captured as HSI. Second, this relationship provides all the shortfalls and benefits of those methods and their potential possibilities. Finally, we identified the limitations of our present review work and how to rectify them in the future.

Conducting the Review.
e entire review work has been conducted in the following steps: e studies that deal with the hyperspectral images of a particular land scene are considered, discarding the medical hyperspectral imagery, water reservoir, etc.
(iv) Design of study: e studies comprising experimental outcomes and the elaboration of the models are accepted; other literary-based articles or review papers are only for primary knowledge gain.
(v) e language used: e studies written in the English language are only considered. Figure 1 represents the total number of the literary studies screened individually on each of the categories of chosen ML techniques in the form of piecharts with a percent-wise pattern. Figure 2 is a standard graphical depiction of the number of most recent articles that we screened for each chosen MLbased method in the period ranging from 2015 to 2021.
(c) Selection: Out of all the papers screened based on the abovementioned criteria, a few most eligible are handpicked. e selection has been made keeping specific parameters: the modeling strategy and algorithm and its suitability with the modern technological scenario. e final result is the corresponding overall accuracy (COA) for each dataset used, preferably journals with a good citation index. (d) Analysis and inference: ese selected papers are thoroughly reviewed to determine their contribution, restrictions, and future propositions. Based on this analysis, the deductions are drawn to show the pathway of further research.

Research Investigations (RI).
e analysis arises some of the queries:

Machine Learning-Based Techniques for HSI Classification
ML technologies are not only intelligent and cognitive, but also their accuracy is skyrocketing due to their embedded mechanical abilities such as extraction, selection, and reduction of joint spatial-spectral features as well as contextual ones [24][25][26]. Moreover, the hidden dense layers with various allocated functions of the extensive networks work as intelligent learners by creating dictionaries or learning spaces to store deterministic information and then separate the landcover classes through its classification units [27][28][29]. e latest ML techniques that assist in classifying the hyperspectral data, that is, SVM, SRC, ELM, MRF, AL, DL, and TL, are shown categorically in Figure 3 and are discussed hereafter in detail.

Support Vector Machine (SVM)
. SVM is an innovative pattern-recognition technique rooted in the principle of statistical learning. e rudimentary concept of SVM-based training can unravel the ideal linear hyperplane so that the predicted classification error is mitigated, be it for binary or  Computational Intelligence and Neuroscience multiclass purposes [30], as depicted in Figure 4. For linearly separable binary classification, let (x i , y i ) be the standard set of linearly separating samples with x ∈ (R) N and y ∈ {−1, +1}. e universal formula of linear decision function in n-dimensional space with the classification hyperplane is where w is the weight directional vector and b is the slope of the hyperplane. A separating hyperplane with margin 2/||w|| in the canonical form must gratify the following constraints: For multiclass scenarios, we presumably transform the datapoints to S, a probable infinite-dimensional space, by a mapping function ψ defined as ψ(x) � (x 1 2 , x 2 2 , √2x 1 x 2 ), x � (x 1 , x 2 ). Linear operations performed in S resemble nonlinear processes in the original input space. Let K(x i , x j ) � ψ(x i ) T ψ(x j ) be the kernel function, which remaps the inner products of the training dataset.
Constructing SVM requires values of the constants, that is, Lagrange's multipliers, α � (α 1 , . . ., α N ) so that is maximized with the constraints with respect to α: Because most α i are supposedly equal to zero, samples conforming to nonzero α i are support vectors. Conferring to the support vectors, the modified optimally ideal classification function is e application of SVM for classifying HSI started two decades ago [31,32]. Focusing on the potentially critical issue of applying binary SVMs [33], fuzzy-based SVM [34] as fuzzy input-fuzzy output support vector machine (F2-SVM), SVM evolved to dimensionality reduction and mixing of morphological details [35]. It also assisted particle swarm optimization (PSO) [36] and wavelet analysis with semiparametric estimation [37], as the classifier "wavelet SVM" (WSVM). Table 1 summarizes the research carried out so far for the classification purpose of HSI using SVM.

Sparse Representation and Classification (SRC).
Sparse method depends on dictionary learning that enhances and rectifies the values of parameters based upon the current training observations while accumulating the knowledge of the previous observations prior. It then generates the sparse coefficient vector using sparse coding.
is method is supremely efficient as it embeds dictionary learning to extract rich features embedded inside the HSI dataset. SR can classify images pixelwise by representing the patches around the pixel with a linear combination of several elements taken from the dictionary. e generalization of SRC called multiple SRC (mSRC) has three chief parameters-patch size, sparsity level, and dictionary size. Dictionary learning is the first step for sparse, using K-SVD algorithm. Let Y � [y 1 , y 2 , . . ., y N ] be a matrix of L2-normalized training samples y i ∈ R m [45][46][47]. e size of patches around the pixel is where D is a member of R mXn is the learned over a complete dictionary, with n > m atoms, B � [b 1 , b 2 , . . ., b m ] represents the matrix of corresponding sparse coding vectors b i ∈ R n , and ||·|| F is the Frobenius norm. Sparsity S limits the number of nonzero coefficients in each b i . e next step sparse coding is provided with dictionary D and represents y as a linear combination of y � Db where b is sparse. For the final classification step, suppose for each class j ∈ {1, . . ., M} of an image, a dictionary D i is trained. en, the classification of a new patch y test is achieved by estimating a representation error. e class assignments rule [47] is calculated through a pseudoprobability measure P(C j ) for each class error E j as Computational Intelligence and Neuroscience mSRC obtains residuals of disjoint sparse representation of y test for all classes j. Each dictionary D j is updated by eliminating nonzero atoms from b j after each of k iterations and y test is assigned to the class, using Q total iterations: Sparse representation is an essential and efficient machine-dependent method in many areas, including denoising, restoration, target identification, recognition, and monitoring. It may grow even more vital when associated with logistic regression, adaptivity, and super-pixels to extricate the joint features globally and locally. SR has a very high potential of being associated with methods such as PCA, ICA, Markov random fields, conditional random fields, extreme learning machines, and DL methods such as CNN and graphical convolutional network. Table 2 gives a summary of the research performed so far for the classification purpose of HSI employing SRC.

Markov Random Field (MRF).
MRF describes a set of random variables satisfying Markov probability, depicted by undirected graphs. It is similar to the Bayesian network but, unlike it, undirected and cyclic. An MRF is represented as a graphical model of a joint probability distribution defined in Figure 5. e undirected graph of MRF, G � (V, E), in which V is the nodes representing random variables.
Based on the Markov properties [57], the neighborhood set N c of a node c is defined as e conditional probability of Y c decides the joint distribution of Y as To prosper the construction, the graph G absorbs a Gibbs distribution all over the maximum cliques (C) in G: where Z is the partition function. erefore, equation (11) can be rewritten as where T is the temperature, whose value is generally 1, and U(y) � mЄC V m (y m ) represents the energy. Markov models depict the stochastic method that is represented by a graph made of circles has an acute advantage of not considering the past states for all upcoming future states for a random alterable dataset such as HSIs. e variants of Markov random fields are adaptive, hierarchical, cascaded, and probabilistic, a blend of Gaussian mixture model, joint sparse representation, transfer learning, etc., whose outcomes are pretty victorious. Hidden Markov random fields are highly suitable for the unsupervised classification of HSIs where the model parameters are estimated to make each pixel belong to its appropriate cluster [58], leading to the precise classification. Table 3 lists out the research carried out so far for the classification purpose of HSI employing MRF.

Extreme Learning Machine (ELM).
An efficacious learning algorithm based on single hidden layer feedforward neural network (SLFNN), it is applied to classify patterns and regression. Let ( [72]. e standard SLFNN having N hidden nodes and f(x) as activation function is approached mathematically as  Specializes in extracting covariance traits from a spatial square neighborhood to calculate the analogy of matrices with covariances employing the conventional Gaussian form of Kernel Creation of adaptive local regions using superpixel segmentation methods and learning the required kernel using multiple kernel learning methods

Computational Intelligence and Neuroscience
Here, w i � [w i1 , . . ., w in ] T gives the weight vector establishing the connection between input nodes and i th is the hidden node and α i � [α i1 , . . ., α im ] T represents the weight vector connecting between output node O j with the i th hidden node, and w i .x j represents the inner product. e zero error for N samples can be written in the matrix form as is the neural network hidden layer output matrix, and the i th is hidden node output with respect to x 1 , . . ., x N ; the i th column of A represents x N inputs. e training of SLFNN is based on finding specific α, w i , and b i, (i � 1, . . ., N) [73] such that is equation denotes the cost function with a depreciation. By using gradient-based algorithms, the set of weights (α i , w i ) and biases b i are attuned with epochs as e learning rate η must be accurate for better convergence and N << N for better generalization performance.
Extreme learning methods proposed overcoming the disadvantage of a single hidden layer feedforward neural network and improving learning ability and generalization performance. It is a supervised method but is highly recommended to get an extension to its semisupervised and unsupervised versions for dealing with the huge amount of data such as HSIs, which are primarily unlabeled and suffering from lack of training samples. Great potential lies with its other variants than those mentioned here, [74] of ELM, like two-hidden layer ELM, multilayer ELM, feature mapping-based ELM, incremental ELM, and deep ELM to become superior and achieve victorious precision in classifying HSIs. Table 4 underneath provides the summary of the research executed so far for the classification purpose of HSI utilizing ELM.

Active Learning (AL).
It is a special type of the supervised ML approach to build a high-performance classifier while minimizing the size of the training dataset by actively selecting valuable data points. e general structure of AL can be understood from Figure 6. ere are three categories of AL-stream-based selective sampling, that is, where each unlabeled dataset is enquired for a certain label whether to assign a query or not; pool-based sampling; that is, the whole dataset is under consideration before selecting the best set of queries; and membership query synthesis; that is, it involves data augmentation to create user selected labeling. e decision to select the most informative data points depends on the uncertainty measure used in the selection. In an active learning scenario, the most informative data points are those the classifier is least sure about. e uncertainty measures for datapoints x [88] are Least Confidence (LC): responsible for selecting the classifier's data point is least certain about the chosen class. With y * as the most likely label sequence and v as the learning model, LC is represented as Effectively depict the multiple complicated aspects of the HSI and will be considered for future spatial knowledge 10 Computational Intelligence and Neuroscience Smallest Margin Uncertainty (SMU): Represents the difference between classification probability of the most likely class (y 1 * ) and that of the second-best class (y 2 * ), written mathematically as: Largest Margin Uncertainty (LMU): Represents the difference between classification probability of most likely class (y 1 * ) and that of the least likely class (y min ), written mathematically as: Sequence Entropy (SE): Detects the measure of disorder in a system; higher the entropy implies a more disordered condition. e denotation of SE is with y ranging over all possible label sequences for input x.
Although not considered customary and coherent, AL is pretty much capable of reducing human effort, time, and processing cost for a large batch of unlabeled data. is method relies on prioritizing data that needs to be labeled in a huge pool of unlabeled data to have the highest impact on training. A desired supervised model keeps on being trained through active queries and improvising itself to predict the class for each remaining data point. AL is advantageous for its dynamic and incremental approach to training the model so that it learns the most suitable label for each data cluster [89]. Table 5 lists out the research performed so far for the classification purpose of HSI using AL.

Deep Learning (DL).
Deep learning is the most renowned ML technology in application and accuracy terms. Although it is considered the next tread of ML, it also lends concepts from artificial intelligence. DL is the mother of algorithms that resemble human brain simulations, that is, creativity, enhanced analysis, and proper decision-making, based on pure or hybrid large networks for any given real-life problem. It has enhanced the throughput of computerbased, especially unsupervised snags for the practical technology-based applications such as automated translation of machines, image reconstructions and classifications, computer vision, and automated analysis. [104] e basic structure of any DL model possesses a three-type-layered architecture: it contains one input layer through which input data are fed to the next layer(s) known as the intermediate hidden layer responsible for all the computations based on the problem given, which passes its generated data to the final layer, that is, the output layer, which provides the desired ultimate output. e steps involved in DL models are as follows: having proper knowledge and understanding of the problem, collecting the input database, selecting the most appropriate algorithm, training the model with the sample source database, and finally testing the target database [105].
DL models are more efficient and advantageous over other ML models due to the following reasons [19]: (1) e capability to extract hidden and complicated structures from raw data is inextricably linked to their ability to represent the internal representation and generalize any form of knowledge. (2) ey have a wide range of data types that they can accommodate, for example, 2D imagery data and complex 3D data such as medical imagery and remote sensing. In addition, they can use HSI data's spectral and spatial domains in both standalone and linked ways [106][107][108]. (3) ey provide architects a lot of versatility in terms of layer types, blocks, units, and depth. (4) Furthermore, its learning approach can be tailored to various learning strategies, from unsupervised to supervised, with intermediate strategy. (5) Additionally, developments in processing techniques, including batch partitioning and high-performance computation, especially on distributed and parallel architecture, have enabled DL models to find better opportunities and solutions when coping with enormous volumes of data [109].
e models that are broadly used for HSI classification are described as follows.
(a) Autoencoder (AE): AEs are the fundamental unsupervised deep model based on the backpropagation rule. AEs consist of two fragments: encoder, connecting the input vector to the hidden layer by a weight matrix; decoder, formed by the hidden layer output via a reconstruction vector tied by a specific weight matrix. SAEs are AEs with multiple hidden layers where the production of every hidden layer is fed to the successive hidden layer as input. It comprises three steps: (1) first AE trained to fetch the learned feature vector; (2) the former layer's feature vector is taken as input to the next layer, and this process is redone till the completion of training;  (3) backpropagation is used after all the hidden layers have been trained to reduce the cost function and to update the weights is done with a named training set to obtain fine-tuning [110]. e architecture of SAE is depicted in Figure 7. Let x n ∈ R m ; n � 1, 2, . . ., N represent the unlabeled input dataset, E n be the hidden encoder vector computed by x n , and y n be the decoder vector of the output layer [111].
g-> encoding function, W i -> encoder weight matrix, b i -> encoder bias vector.
Decoder : y n � f W j E n + b j ; f-> decoding function, W j -> decoder weight matrix, b j -> decoder bias vector.
AEs are unsupervised neural networks that embed several convolutional hidden layers based on nonlinear activation functions and transformations [112]. ere are high risks of data loss during training, but it handles the model well for specific data types through specialized training. ere are AEs for every purpose such as convolutional, sparse, variational, deep, contractive, and denoising applied for data compression, noise removal, feature extraction, image augmenting, and image coloring. AE inevitably provides a vast platform for further research on its various applicability and its capability to participate in hybridization. Table 6 describes a few research works in the aspect of AEs.
(b) Convolutional Neural Network (CNN): It is a famous deep neural network that works like a human visual cortex with many interconnected layers applied widely in image, speech, and signal processing. It assigns learnable and modifiable weights and biases to the input image to identify various objects or patterns with differentiable features. As shown in Figure 8, each layer of CNN possesses filtering capabilities with ascending complexities: the first layer learns filtering corners and edges; intermediate layers learn object parts filtering; and the last layer learns filtering out the entire object in different locations and shapes. e comparison between the layers in terms of several parameters is shown in Table 7. It consists of four layers [117,118]:   Better learning levels than the random choice of data points and an entropy-based AL Measurement of the efficacy of the active learning-based knowledge transfer approach while systematically increasing the spatial/ temporal segregation of the data sources 2010 Semi-supervised-segmentation with AL and multinomial logistic regression (MLR-AL) [91] IP  Novel approach proposed based on superpixels density metric Development of a pixelwise solution to produce super-pixel-based neighborhoods  Figure 7: e network structure of stacked autoencoders; input X-to-E is the encoding phase; E-to-output Y is the decoding phase.

Encoder Decoder
(2) Activation: e convolution layer produces a matrix significantly smaller than the actual image. e matrix is passed through an activation layer (generally rectified linear unit, aka ReLU), adding nonlinearity that enables the network to train itself through backpropagation.
(3) Pooling: It is the method of even more downsampling and reduction of the matrix size. A filter is applied over the results obtained by the previous layer and chooses a number from each set of values (generally the maximum, the max- (4) Fully Connected (FC): A typical perceptron structure with multilayers. e input is a singledimensional vector representing the output of the layers above. Its output is a probability list for the various possible labels attached to the image. Classification decision is the mark that receives the highest likelihood. It is mathematically represented with transformation function g, for N samples of inputs with X″ and Y″ being the outputs having W being the weight matrix and b, the bias constant, is as follows: CNN is the most method-in-demand and widely explored model among all DL models. e functional unit of convolutional layers is kernels that expertise in extricating the most relevant and enriched spatial and spectral features from the given dataset through automated filtering by convolution operation [119]. It provides an intense description of the whereabouts of CNNs. e most popular ones are attention-based CNN, ResNet, CapsNet, LeNet, AlexNet, VGG, etc. Some of them are still unexplored yet in classifying HSI. e detailed research work on CNN for dealing with HSI classification is listed in Table 8.
(c) Recurrent Neural Network (RNN): DL is a very efficient approach that follows a sequential framework with a definite timestamp t. "Recurrent" refers to performing the same task for each sequence element, with the output depending on the preceding computations. In other words, they have a "memory" that enfolds information about the calculation so far type of neural network, and the output of a particular recurrent neuron is fed backward as input to the same node, which leads the network to efficiently predict the output, represented in Figure 9, where RNN unrolls, that is, show the complete sequence of the entire network structure neuron by neuron. It consists of the following steps: (1) X � [. . ., x t−1 , x t , x t+1 , . . .] be the input vector, where x t represents input at timestamp t. (2) h t is the "memory of the network," the hidden state at timestamp t. Preliminarily, h −1 is initialized to zero vector to calculate the first hidden step. h t being the current step is calculated based on previously hidden step h t−1 , formulated by [132] h where f denotes a function of nonlinearity, that is, tanh or ReLU, and W be the weight vector.
where y t represents input at timestamp t, generally a softmax function: RNN is an efficient deep model with large potential. e recurrence looping structure acquainted with RNN enables it to store relevant information about spatial-spectral relationships between the pixels and neighbors.
ere are several RNN architectures based on inputs/ outputs as stated in [133], and based on LSTM, there are five categories [134]. ese variates can be well utilized in collaboration with other DL methods such as MRF and PCA to find their accuracy.
e literature studies based on RNN are cataloged in Table 9.
(d) Deep Belief Network (DBN): DBNs are formed by greedy stacking and training restricted Boltzmann machines (RBMs), an unsupervised learning algorithm based on "contrastive divergence." For neural networks, RBMs suggest taking a probabilistic approach and are thus called stochastic neural networks. Each RBM is made of three parts: a visible unit (input layer), an invisible unit (hidden layer), and a bias unit. e general structure of a DBN is depicted in Figure 10. For a DBN, the joint distribution of input vector, X with n hidden layers h n, is defined as [137] P  Figure 9: e RNN structure with recurrent neurons.

Computational Intelligence and Neuroscience
DBN is the graphical representation that is generative; that is, it creates all distinct outcomes that can be produced for the particular case and learn to disengage a deep hierarchical depiction of the sample training data. DBNs are structurally more capable than RNNs as they lack loops, are pretrained in an unsupervised way, and are computationally eminent for particularly classification problems. Minor modifications or collaborations can improvise DBNs functionally and accuracy. (e) Generative Adversarial Network (GAN): One of the most recent DL models that are rapidly growing its footsteps in the area of technical research. e GAN model is trained using two kinds of neural networks: the "generative network" or "generator" model that learns to generate new viable samples and the "discriminatory network" or "discriminator," which learns to discriminate generated instances from existing instances. Discriminative algorithms seek to classify the input data, which is given as a collection of certain features; the algorithm maps feature on   An enhanced model that utilizes the intrinsic feature provided by HS pixels with better accuracy than SVM e study is limited to only spectral features Incorporation of deep end-to-end convolutional RNN with both spatial-spectral features 2019 Spectral-spatial cascaded recurrent neural network (SSCasRNN) [135] IP-91.79%, UP-90.30% Outruns pure RNN and CNN models due to the perfect placement of convolutional and recurrent layers to explore joint information 2020 Geometry-aware deep RNN (Geo-DRNN) [136] UP-98.05%, IP-97.77% Due to encoding the complex geometrical structures, the data lack space Minimization of memory-occupation 2021 2D and 3D spatial attention-driven recurrent feedback convolutional neural network (SARFNN) [28] IP-99.15%, HU-86.05% Integrating attention and feedback mechanism with recurrent nets in two layers, 2D and 3D, enables efficient accuracy  20 Computational Intelligence and Neuroscience labels [140]. In contrast, generative algorithms attempt to construct the input data, which is given with a set of features, and it will not classify it, but it will attempt to create a feature that matches a certain label. e generator tries to get better at deluding the discriminator during the training, and the discriminator tries to grab the counterfeits generated by the generator. us, the training procedure is termed adversarial training. e generator and discriminator should be trained against a static opponent, keeping the discriminator constant while training the generator and keeping the generator constant when training the discriminator. at helps to understand the gradients better.
In a GAN model, say D and G denote the discriminator and the generator units that map a noise data space θ to real and original data space x, respectively. G(θ) denotes the fake output generated by G, and D(y), and D(G(θ)) are D's output for real and fake training samples, respectively. P θ (θ) and P d (y) represent the input model distribution and original data distribution, respectively, when θ∼P θ [141] as shown in Figure 11.

(29)
Combining equations (28) and (29), the total loss of the entire dataset represented by the min-max value function is given by GAN is a generative modeling neural network architecture based on the concept of adversarial training that utilizes a model to build new instances that are conceivably derived from an existing sample distribution. Hence, GANs are new favorites for classifying HSIs as they compensate for the lack of data problem and classify the data in a pro manner.
ere are several types of GANs-conditional GAN, vanilla GAN, deep convolutional GAN (simple type); and Pix2Pix GAN, CycleGAN, StackGAN, and InfoGAN (complex type) [142]. ese may be very useful for images like HSIs as they can deal with related issues. e research works based on the GAN are listed in Table 11.

Transfer Learning (TL).
It is the most current hot topic in interactive learning, and there are more to it to be explored. It is an approach where information gained is transferred in one or more source tasks and is used to enhance the learning of a similar target task. TL can be represented diagrammatically by Figure 12 and mathematically shown as follows: Domain, D, is represented as {X, P(X)}, X � {x 1 , . . ., x n }, x i ∈ X; X denotes the feature space, and P(X) symbolizes the marginal probability of sample data point X [149].
Task T is depicted as {Y, P(Y|X)} � {Y, Φ}, Y � {y 1 , . . ., y n }, y i ∈ Y; Y is the label space, Φ is the prognostic objective function, having learned form (feature vector, label) couples, (x i , y i ); x i ∈ X, y i ∈ Y, and calculated as the conditional probability.
Also, for every feature vector in D, Φ predicts its corresponding label as Φ(x i ) � y i .
If D S and D T be the source and target domains, T S and T T be the source and target tasks, respectively, with D S ≠ D T and T S ≠ T T . TL objectifies to learn P(Y T |X T ), that is, the target conditional probability distribution in D T with knowledge obtained from D S and T S .
Traditional learning is segregated and solely based on particular tasks, datasets, and different independent models working on them. No information that can be converted from one model to another is preserved, but on the contrary, TL possesses the human-like capability of transferring Computational Intelligence and Neuroscience knowledge; that is, knowledge can be leveraged from priorly trained models to train new models, the process of which is faster, more accurate, and with the limited amount of training data. Table 12 represents a brief detail about the research works on transfer learning.

Discussion
Based on the reviewed articles, we can draw the desired inferences that provide answers to the investigative questions mentioned in Section 2 and show the clear motive and benefits of this review.

RI 1: What is the significance of traditional ML and DL for analyzing HSI?
Ans: Hyperspectral data have certain restrictions, as cited in Section 1. Statistical classifiers initially addressed them, but the operations and analysis became much easier and more accurate after the invention of ML/DL strategies in a machine-dependent way [155,156]. e general advantages that researchers were provided by the ML/DL algorithms while dealing with HSIs are as follows: (i) easy dealing with high-dimensional data, that is, troubles of Hughes phenomenon removed [115,125]; (ii) equally manipulative to labeled and unlabeled samples [99,150]; (iii) precise and the meticulous choice of features [51,127]; (iv) high-endprecise models to deal with real hypercubes, hence top-notch classification accuracy [119,154]; v) removes overfitting, noises, and other hurdles to a much greater extent [120,147]; (vi) embedded spatial-spectral feature extraction and selection units [119,133]; (vii) mimics human brain to solve multiclass problems [136,138].

RI 2: How are ML/DL more impactful on HSI than other non-ML strategies?
Ans: e initial discovery of hyperspectral data has suffered due to its limitations. In the preliminary research stage, the scientists followed the traditional methodology for classifying HSIs, that is, preprocessing (if required), extraction, and selection of discriminative characteristics and then ran a classifier on those features to identify the land cover groups. Hence, they emphasized the feature extractor techniques such as PCA [9], ICA [10], and wavelets [13], assisted by some basic random classifiers such as extended morphological profiles [2,157], NN [158,159], logistic regression [160], edge-preserving filters [10,161], density functions/matrices [162], and Bayes law of classification [163,164]. ese classic mathematics-oriented techniques were not enough to deal with such a huge amount of data like HSI, as they were simple in structure and design and easy to   implement. It also could not predict well enough the multiclass problems, which is very much required for a dataset like HSI, whose land covers belong to multiple classes of regions. Also, these methods were not accurate in feature selection and extraction or dealing with the storage of such bulk data. ese reasons made researchers struggle to analyze properly, process, and classify HSIs. On the contrary, the advancements of ML/DL technologies have opened a broad gateway of research that researchers are still exploring and combining with different groupings to address the HSI classification problem in real life, dealing with the limitations mentioned above [26,131]. e tabular depiction of the advantages and disadvantages of the ML and non-ML strategies applied for HSI classification is shown in Table 13.

RI 3: What are the advantages and challenges faced by the researchers for the chosen ML/DL-based algorithm for HSI classification?
Ans: We added the advantages and challenges of the MLand DL-based techniques in Table 13.

RI 4: What are the emerging literary works of ML/DL on HSI classification in the year 2021?
Ans: In the ongoing years, 2021 seems to be more promising in terms of technical advancements for the problem concerned. New techniques are emerging, along with hybrid ones, to solve the issue to a whole new level, the methodologies' accuracy to be described. Recent work on MRF with a bandweighted discrete spectral mixture model (MRF-BDSMM) in a Bayesian framework has been proposed in [165], an unsupervised adaptive approach to accommodate heterogeneous noise and find the abundant labeled subpixels to extricate joint features. A collaboration of Kernel-based ELM with PCA, local binary pattern (LBP), and gray-wolf optimization algorithm (PLG) is proposed as novel methodologies. ey help reduce huge dimensions, seek global and local-spatial features, and optimize the KELM parameters to obtain the class labels [166]. A variant of SRC is proposed in [167], dual sparse representation graph-based collaborative propagation (DSRG-CP) that separates spatial and spectral dimensions with the respective graph to improve the labeling scheme limited samples by collaborating the outcomes. AL has been one of the hot topics so far, as it integrates with a Fredholm kernel regularized model (AMKFL) that enables better labeling than manual ones, even for noisy images [168]. It ties with DL with the augmentation of training samples to label the uncertain hypercubes (ADL-UL) accurately [169], facilitates iterative training sample augmentation by expanding the hypercubes and adds discriminative joint features (ITSA-AL-SS) [170], extracts local unique spatial multiscale characteristics from the super-pixels (MSAL) [171]. A novel idea of attention-based CNNs is proposed in [172,173], the former (SSAtt-CNN) collides two attention subnetworks-spatial and spectral with CNN as the base, and the latter (FADCNN) is a dense spectral-spatial CNN with feedback attention technique that perfectly poses the band weights for better mining and utilization of dominant features. GAN is one the most exploited methods to date, and [174] proposes the full utilization of shallow features from the unlabeled bands through a multitasking network (MTGAN); in [175], the discriminator is based upon capsule network and convolutional long short-term memory to extricate less visible features and integrates them to build high-profile contextual characteristics (CCAPS-GAN); 1D and 2D CapsGAN together form a dual-channel spectral-spatial fusion capsule GAN (DcCaps-GAN) shown in [176]; and generative adversarial minority oversampling for 3D-hypercubes (3D-Hyper-GAMO) is depicted in [177] that focuses on the minor class features using existing ones to label and classify them properly.

RI 5: How are ML-and DL-based hybrid techniques helping scientists in HSI classification?
Ans: Since the dawn of the emergence of HSIs, it has suffered many hurdles in its path of analysis and information extraction. e maximum number of highly correlated bands and the high spatial-spectral features signature by the electromagnetic spectrum embedded in it are always considered a traction matter.
us, finding an appropriate technology for the classification of such interconnected and hugely confined featured high-dimensional images is a very tedious and strenuous matter. e classification methods chosen so far have been mostly limited to supervised. e requirement of a sufficient number of quality-labeled data and unsupervised, in which the lack of coherence between the spectral clusters and the target regions, causes the failure in obtaining the desired accuracy. A semi-supervised method is needed to overcome such problems as a combination of supervised and unsupervised methods, named the hybrid method. A hybrid method is always advantageous in robustness and flexibility towards the high-dimensional data. e hybrid methods have the following benefits: (i) Specifically designed to overcome the limitations and take advantage of the methodologies involved in the concerned hybrid to achieve a deep, rich, and insightful conclusion (general). (ii) Addressing and resolving multiple issues regarding the handling and analyzing the HSI data, at a time, depending upon the methods that are chosen for mixing/hybridizing [179][180][181][182][183]. (iii) Coherence in time, space, and cost complexities [184][185][186]. (iv) Better interpretability, quality, effectivity leading to the construction of a more refined framework [180,182,183,[187][188][189][190][191][192][193][194].
ML, being a standard versatile technology, can merge with traditional techniques like PCA for its benefit. As stated in [195,198], PCA is exploited at its best for feature extraction, selection, and reduction to achieve higher accuracy and performance quality. PCA is one of the best preprocessing methods considered to date for improvised spectral dimension reduction [180], proper selection of spectral bands and their multiscale features in a segmented format [181,199], noise-reduced spectral analysis [27], and feature extraction [130,196]. PCA, in collaboration with SVM [195,200], DL for feature reduction and better classification [182,183], CNN with multiscale feature extraction [188,189], and sparse tensor technology [190], has highly been appreciated as soulful research. All these recent time collaborations and a special honor to the merging of ICA-DCT with CNN cited in [191] are the evidence that although PCA is categorized under traditional methods, it is supremely relevant for its significant usefulness in handling HSIs.
Some other hybridizations are also explored by researchers, such as SRC with mathematical index of divergence-correlation [192], Gabor-cube filter [193], and ELM [83,85]; ELM with CNN [86] and TL [26]; AL based on super-pixel profile [201,202], AL with CNN [203], CapsNet [204], CNN [204,205], and TL [151,184]; CNN with attention-aided methodology [172,173,185] and GAN [186]; GAN with dynamic neighborhood majority voting mechanism [194,197], CapsNet [175,176,206,207]; and TL with MRF [70]. ese articles depict the highly tenacious performance with literal mitigation of the computational complexities enforced on the raw HSI data to build a strong and enhanced model for achieving higher accuracy than ever.   [32,41,43] (i) It works very well for binary classification but fails for generating accurate classes for multiclass problems [31] (ii) Supports both supervised, semi-supervised, and unsupervised problems with less overfitting risks [24,33,37,44] (ii) Training time is high for high-class datasets like HSI [31,32] (iii) Form of a sigmoid kernel that deals better than the rest of the previous for unlabeled and unstructured HSI datasets [35,[40][41][42] (iii) Difficulty in fine-tuning the parameters [41,42] (iv) e capability of solving the classification problem for both binary and multiclass problems by outperforming several methods [39] (iv) Complex interpretability [33,35] (v) Can improve the performance if assisted with other supporting methods [36,[40][41][42] (v) Lack of easy generalization to the datasets having multiple classes [33,35] (vi) Complexity in building the model due to a lack of sufficient labeled samples [31,32] Sparse representation and classification (i) A dictionary with relevant data is used for learning with a minimal number of optimal parameters [45,46] (i) Making the dictionary considers high expense overheads [50] (ii) Builds precise and powerful classification models with higher interpretability through sparse coding [49,50,54] (ii) e dictionary or the coding might cause loss of information [48,178] (iii) Proper memory usage in an optimized manner [53,55,178] (iii) Difficulties in representing such high-profile with higher resolution image data like HSI through the sparse matrix [47,48] (iv) Reduces the estimated variance between the classes to produce better outcomes [49,56,178] Markov random field (i) Works well for a wide range of unstructured problems and no direct dependency between classes and the parameters [67,69] (i) Normalization of data might be hectic for high dimension data [63,70] (ii) Better denoising effect [59] (ii) Suffers from the lack of training undirected data that might not be possible to represent graphically [61,62] (iii) Robust for both spatial and spectral distributions [62,64] (iii) Poor interpretability [63,68] (iv) Time complexity is low due to the graphical representation of data [63] Extreme learning machines (i) Less training time and faster learning rate as compared to previous methods [86] (i) Higher computational hazard [76][77][78][79][80] (ii) Avoidance for local minima and finishes job in single iteration [83,87] (ii) e wrong choice of an optimal amount of the hidden layer neurons may cause redundancy in the model and hence affect the classification accuracy [85,86] (iii) Advantageous for overfitting caused due to several bands in HSIs [83] (iii) ere is plenty of room for advancements in the algorithm to accommodate itself to be compatible for dealing with HSI data [78,82,86] (iv) Builds an enhanced model with better prediction performance at the optimized expense [86] (v) Improved generalization ability, robustness, and controllability [78,84,85] Computational Intelligence and Neuroscience 25 RI 6: What are the latest emerging techniques associated with addressing classifying HSIs?
Ans: e following are the most recent research studies that have enlightened a new path of dealing with the purpose: (i) DSVM: e latest and novel concept incorporates DL facilities with traditional kernel SVM. is combines four deep layers of kernels with SVM being the hidden layer units, namely, exponential and gaussian radial basis function (ERBF and GRBF), neural and polynomial [208]. is approach has outperformed several efficient DL methods with nearly 100% accuracy for IP and UP datasets.
(ii) Conditional Random Fields (CRFs): ese are the structured generalization of multinomial logistic regression in the form of graphical models based on a priori continuity considering the neighboring pixels of analogous spectral signatures that possess the same labels. ey extensively explore the hidden  [76][77][78][79][80] (ii) Ease in segregating the interclass and intraclass features through active query sets [91,95,102,103] (ii) e wrong choice of an optimal amount of the hidden layer neurons may cause redundancy in the model and hence affect the classification accuracy [85,86] (iii) Training speed is comparatively high for not so large-scale data [103] (iii) ere is plenty of room for advancements in the algorithm to accommodate itself to be compatible for dealing with HSI data [78,82,86] (iv) Knowledge-based solid models can be generated [103] (v) Achieves greater classification accuracies for unlabeled HSIs [95,102] Deep learning (i) Diverse, unstructured, and unlabeled raw HSI datasets are finely processed where preprocessing of the data is not needed [110,122,125,144] (i) Suffers from a lack of a large amount of HSI data, which is practically unavailable [123,136] (ii) Possesses the capability to address supervised, semisupervised, and specifically unsupervised learning problems [127,128,137] (ii) e extreme expense to generate an appropriate model by training a complex data structure like HSIs [114,139,148] (iii) Expertise in dimension reduction, denoising, feature extraction as embedded properties [27,114,124] (iii) Low interpretability [131,147] (iv) Address in an illustrious manner to the issues such as Hughes phenomenon, overfitting, and convergence. [120,124,145] (iv) eoretically not sound, hence incomprehensible where an error occurs and its rectification [122,124,145] (v) Robust and adaptive to new features introduced in the dataset [26,123,145] (v) High time and space complexity and computational hazard [131,136,148] (vi) e hidden layer neurons are proven to be eminent in training the desired model with a highly qualified prior knowledge (DBN, RNN, CNN) [127,129,135,138] (vii) Computational efficiency with high-performance speed (CNN, SAE) [114,115,127,128] (viii) Data augmentation facility (GAN) [143,145] Transfer learning (i) Works as a combination of different models, be it traditional or latest machine-lefted techniques, that together brings out a highly improved hybrid model [151,152] (i) Data overfitting [150] (ii) Capable of transferring knowledge from the source domain, that is , a pretrained model to the target domain, that is, a new model to make it more enriched [151,152] (ii) Complex structure of the model [150,151] (iii) Greater feature extraction and selection capability [152] (iii) Less interpretability (iv) Stable model with highly optimized parameters and hyperparameters [154] (iv) Difficulty in implementation (v) High training speed and accuracy with low computational cost [26,153] (vi) Reduced computational cost and training time complexity [153,154] 26 Computational Intelligence and Neuroscience spectral-contextual information. In [146], CRF incorporates with semi-supervised GAN whose trained discriminators produce softmax predictions that are guided by dense CRFs graph constraints to improve HSI classification maps. A collaboration between 3D-CNN and CRF has been proposed in [209] to make a deep CRF capable of extracting the semantic correlations between patches of hypercubes by CNN's unary and pairwise potential functions. A semi-supervised approach is depicted in [210], embedding subspace learning and 3D convolutional autoencoder to remove redundancy in joint features and obtain class sets using an iterative algorithm. In [211], CRF with Gaussian edge potentials associated with deep metric learning (DML) classifies HSI data pixelwise using the geographical distances between pixels and the Euclidean distances between the features. A novel framework using HSI feature learning network (HSINet) with CRF is proposed [212] that is a trainable end-to-end DL model with backpropagation that extracts joint features, edges, and colors based on subpixel, pixel, and super-pixels. In [213], a decision fusion model including CRF and MRF is built based on sparse unmixing and soft classifiers output. (iii) Random Forest (RF): It is an efficient algorithm that ensembles regression and classification tree. It enables the HSI classification model to be noise-tolerant, inherent in the multiclass division, robustness in parallelism, and speed. In [214], RF is compared to the DL algorithm, which outshined the classification accuracy. A new framework of cascaded RF is shown in [215] that uses the boosting strategy to generate and train base classifiers and Hierarchical Random Subspace Method to select features and suitable base classifiers based on the diversity of the features. A novel collaboration of semi-supervised learning and AL and RF is featured in [216], where the queries based on spatial information are fed to AL, and then, the labeled samples are classified by RF through semi-supervision. [217,218] depicts a deep cube CNN model that extracts pixelwise joint features and is classified by RF. (iv) Graph Convolutional Network (GCN): A descendent of CNN, a structure designed to generalize and convert the convolution data to graph data. It consists of three steps feature aggregation, feature transformation, and classification. Being an expert in graphical modeling considers the spatial interrelations between the classes at its best. In [219], the different unique features collected from CNN and GCN are fused additive, elementwise, and concatenated way. A new framework of globally consistent GCN is introduced in [220], which first generates a spatial-spectral local optimized graph whose global high-order neighbors obtain the enriched contextual information employing the graph topological consistent connectivity; at last, those global features determine the classes. [221] shows the concept of a dual GCN network, which works with a limited number of training samples, where first extricates all the significant features and second learns label distribution. A novel idea of deep attention GCN is introduced in [222] based on similarity measurement criteria between the mixed measurement of a kernel-spectral angle mapper and spectral information divergence to accumulate analogous spectra. [223] emerges as a collaboration between CNN and GCN to extract pixel and super-pixelwise joint features by learning small-scale regular regions and large-scale irregular regions.

Conclusion
is article depicts the various technologies and procedures used for HSI classification since the dawn of its invention to date. ere are many barriers to dealing with such high-band data as HSI mentioned above. Despite that, many researchers have taken their interest in this field to improvise the existing techniques or even invent new ones throughout the last decade. As per the considerable improvement in technologies and the introduction of ML into the classification issues of HSI, it has become more accurate than traditional and contemporary state-of-art methodologies. As a result, DL has emerged as the most eminent work tool for HSI classification for the last half of this decade. e more the researchers focused on this, the more they explored the remote sensing and space imagery features.
is review article bears the individual information for every method and their submethods about their performance, research gaps, and achievements. In addition, it appends a novel research methodology that makes this work more distinctive than others. After going through each methodology's minute details, the most significant inferences have been drawn, which add further novelty to our work. Also, it shows a path of choosing an appropriate technique and its alternatives for future researchers, hence alleviating its creativity and uniqueness, above all other contemporary review works on this subject. Also, it provides the details of the most recent research scenario on HSI classification and some of the currently developed techniques that might be acutely useful in several future research. Our study holds the uniqueness and the novelty regarding several aspects, such as the following: (1) it includes the research works carried out in the last decade, that is, 2010-2020, and the most recent papers of the previous year, i.e., 2021, and we have mentioned it in Section 3; (2) the number of papers referred here is above 200, outnumbering other review papers; (3) the review is carried out by selecting the most appropriate papers solely dedicated to our subject of interest, that is, machine learning techniques serving the purpose of hyperspectral image classification. en, the findings from those works of literature are systematically arranged in the tabular format (Tables 1-12); (4) the objective behind this review work is expressed by RQ 1-6. Also, they provide a clear view of the recent technological advances and applications that the researchers are developing in recent times; (5) Table 14 provides an explicit idea of the pros and cons of each ML technique described in this manuscript when applied for classifying hyperspectral images, which will help the researchers in their future research; and (6) the researcher who wishes to write a literature review can follow our proposed methodology that depicts the flow of work in a methodical way. [224].

Limitations of Present Work and Its
Future Scope e study has some limitations: (i) we have used fewer keywords in the current research (ii) we only focused on seven popular ML techniques; (iii) we briefly explain the emerging methodologies; and (iv) the experimental details are not fully discussed.
As a future proposition, we would like to explore more keywords, more techniques, and more studies that offer a better understanding of other learning methods, both traditional and contemporary. In addition, there are several instances of hybrid strategies along with some more eminent and latest ML/DL techniques that we shall look forward to exploring in both qualitative and quantitative manner.